Tree Ensembles
Readings: ISLR Section 8.2.1, 8.2.3, Code
Bullet items with ^{\dagger} are not in the reading, so not tested.
Setup
Goal. Predict continuous or categorical y_i from N training examples \{\left({x_i, y_i}\right)\}_{i = 1}^{N}, prioritizing prediction performance over interpretability.
Requirements.
- Though we no longer require interpretable tree structure, we still want to report which features are most important.
- As with CART, we must handle continuous and categorical features, along with potential interactions.
- All hyperparameters must be simple to tune. To get better performance, we want more flexible models, but not at the cost of ease-of-use.
Approach.
We’ll study methods that combine many decision trees into a single predictor.
Bagging: Use the bootstrap to resample the original training examples. Trains separate decision tree models on bootstrap resamples of the training data. Predict by averaging (regression) or majority vote (classification) across those trees.
Boosting: Fit tree models sequentially. Each tree fixes the mistakes of those that came before.
These methods have a reputation of working well “out-of-the-box.” They are easy to setup and tune, yet often give high performance. This makes them popular in industry and ML competitions.
Weaknesses of CART
There are three properties that limit CART’s prediction performance: instability when predictors are correlated, the constraint of axis-aligned splits, and inefficiency in representing smoothly varying functions.
Instability. When features x_{j} and x_{j'} are correlated, either could be plausibly chosen for a split. If this ambiguity occurs early in the tree growing process, the entire downstream structure changes. The result is high variance in the fitted surface across independent training datasets, which degrades test set performance.
Axis-aligned splits and smooth functions. The decision tree partitions are axis-aligned rectangles. When we approximate smooth decision boundaries (classification) or functions (regression), we need many splits, resulting in deep trees with few training examples per leaf. This increases variance and risks overfitting.
Bagging
A basic fact in statistics is that, if x_1, \dots, x_N are independent with variance \sigma^2, then the average \bar{x}_{N} := \frac{1}{N}\sum_{n = 1}^{N} x_n has variance \frac{\sigma^2}{N}. Averaging independent observations reduces variance. Bagging adapts this idea to decision trees.
Draw b = 1, \dots, B bootstrap samples from the training data, \begin{align*} \mathcal{D}^{(b)} := \{x_i^{b}, y_i^{b}\}_{i = 1}^{N} \end{align*} and fit a separate decision tree \{\hat{f}^{b}\}_{b = 1}^{B} to each.
Exercise: Consider an arbitrary sample i. In what fraction of bootstrap samples would you expect it to appear?
For regression, predict by averaging across trees, \begin{align*} \hat{f}_{\text{avg}}(x) &= \frac{1}{B}\sum_{b = 1}^{B}\hat{f}^b(x) \end{align*} For classification, take a majority vote. This procedure, bootstrap aggregation, is called bagging.
Because averaging already reduces variance, pruning individual trees is no longer necessary. Most bagging implementations skip it.
Out-of-bag error evaluation. One nice property of bagging is that it provides free generalization estimates without a held-out test set or cross-validation. For sample i, set B^{(-i)} = \{b : (x_i, y_i) \notin \mathcal{D}^{(b)}\} the set of bootstrap samples that didn’t include observation i. Predict using only, \begin{align*} \hat{f}_{\text{OOB}}(x_i) &= \frac{1}{|B^{(-i)}|} \sum_{b \in B^{(-i)}} \hat{f}^b(x_i) \end{align*} Since \hat{f}_{\text{OOB}}(x_i) never trained on sample i, it gives an honest sense of generalization performance. Averaging across samples gives the out-of-bag error estimate, \begin{align*} \text{MSE}_{\text{OOB}} = \frac{1}{N} \sum_{i=1}^{N} \left(y_i - \hat{f}_{\text{OOB}}^{(-i)}(x_i)\right)^2. \end{align*}
Exercise: How would you extend OOB error estimation to the classification case?
Variable Importance
Averaging the trees \hat{f}^{b} gives up the interpretability present in each individual tree. We trade simulatability for prediction perfomance.
We can recover some interpretability through variable importance. For variable j, define, \begin{align*} I_j = \sum_{b=1}^{B} \sum_{\text{splits on } j \text{ in tree } b} \Delta(l, j, s) \end{align*} where \Delta(l, j, s) is the RSS reduction (or Gini/Entropy reduction in classification) from splitting the associated node R_l on variable j at threshold s, as defined in the CART notes. This captures two characteristics of important variables:
- Variables that are split many times across the ensemble \{\hat{f}^b\} contribute more terms to the sum
- Variables whose splits lead to large \Delta values (large decreases in RSS or impurity) get higher scores
(some picture?)
Exercise: TRUE FALSE The variable j can have high importance I_j even if it was never a root split in any tree.
Boosting Algorithm
Bagging fits the trees \hat{f}^{b} independently from one another. They can’t address one another’s weaknesses. For example, consider bagging 100 trees depth-1 trees. With one split, the best we can hope for in any one tree is to split one of the variables around 0.5. No amount of averaging will recover the diagonal.
Boosting solves this problem by training trees sequentially. Each new tree fits the residuals of the current ensemble. Even simple single-split trees, accumulated over many steps, can approximate the diagonal decision boundary.
1D Example. Consider learn a regression onto these sinusoidal data.
X <- runif(100, -5, 5) y <- 3 * sin(X) + 0.5 * rnorm(100) df <- data.frame(X = X, y = y) plot(X, y)First, fit a shallow depth-2 tree. We scale its predictions by a learning rate
lambda. This shrinkage prevents any single tree from dominating the ensemble.lambda <- .9 tree_reg1 <- rpart(y ~ X, data = df, control = rpart.control(maxdepth = 2)) df$pred1 <- lambda * predict(tree_reg1, df)Compute residuals from the shrunk prediction and fit a new tree to them.
df$resid1 <- df$y - df$pred1 tree_reg2 <- rpart(resid1 ~ X, data = df, control = rpart.control(maxdepth = 2)) df$pred2 <- lambda * predict(tree_reg2, df)Repeat.
df$resid2 <- df$resid1 - df$pred2 tree_reg3 <- rpart(resid2 ~ X, data = df, control = rpart.control(maxdepth = 2)) df$pred3 <- lambda * predict(tree_reg3, df)The block below visualizes the sequence. Each component tree is simple, but their sum approximates the target well.
Exercise: What would happen if we decreased \lambda to 0.1?
General algorithm.
- Initialize \hat{f}(x) = 0 and r_i = y_i for all i
- For b = 1, 2, \ldots, B:
- Fit a depth-D tree \hat{f}^b to the data \{x_i, r_i\}_{i = 1}^{N}
- Update: \hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)
- Update residuals: r_i \leftarrow y_i - \hat{f}(x_i)
- Output: \hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^b(x)
This has hyperparameters,
- Number of trees B: More trees increases flexibility but risks overfitting.
- Shrinkage (learning rate) \lambda: Smaller values stabilize learning and often improve the final fit, but at the cost of requiring more trees (\to high memory).
Exercise: TRUE FALSE In boosting, the trees are trained on bootstrap resampled versions of the original data.
Another hyperparameter is the maximum depth D, which controls the degree of learnable interactions. When D = 1, each tree \hat{f}^{b} splits a single variable, so the boosted model, \begin{align*} \hat{f}(x) = \sum_{b=1}^{B} \lambda \hat{f}^b(x), \end{align*} has the form \hat{f}(x) = \sum_{j=1}^{J} g_j(x_j), an additive model without interactions. More generally, a depth-D tree involves at most D variables on any root-to-leaf path, so the boosted model can only capture interactions of up to D variables.
Exercise: Imagine working at a company with both very large data and a strong parallel computing infrastructure. Can the trees \hat{f}^{b} in boosting be trained in parallel? What about the \hat{f}^{b} in bagging?
Partial Dependence Plots
\dagger For a sample \mathbf{x} \in \mathbb{R}^J, let \mathbf{x}^{j|=z} denote \mathbf{x} with its j^{\text{th}} coordinate replaced by z.
\dagger Ceteris Paribus profiles describe how the prediction changes when varying one feature while holding the rest fixed. For observation \mathbf{x} and feature j, the CP profile is: h_{\mathbf{x}}^{j}(z) := \hat{f}(\mathbf{x}^{j|=z}) This traces out a curve showing \hat{f}’s sensitivity to feature j when starting from \mathbf{x}. Different samples \mathbf{x} produce different curves h_{\mathbf{x}}^{j}, since the effect of varying x_j depends on the values of the other features.
\dagger Partial Dependence profiles average CP profiles across the training data to summarize the marginal effect of feature j: g^{j}(z) := \frac{1}{N} \sum_{i=1}^{N} \hat{f}(\mathbf{x}_i^{j|=z})
Code Example
We’ll reuse the data from the CART notes. The true surface has interactions between x_1 and x_2 and is not piecewise constant.
sim_data <- tibble( x1 = runif(500), x2 = runif(500), y = 10 * sin(pi * x1 * x2) + 20 * (x2 - 0.5)^2 + rnorm(500) ) grid <- expand_grid( x1 = seq(0, 1, length.out = 100), x2 = seq(0, 1, length.out = 100), ) |> mutate(y_true = 10 * sin(pi * x1 * x2) + 20 * (x2 - 0.5)^2)Bagging . We fit a bagging model with B = 500 trees. The prediction surface shows outlines of a few commonly created partition elements, consistent with the fact that the trees are not building on one another.
bag_fit <- bagging(y ~ x1 + x2, data = sim_data, nbagg = 500) grid$bag_pred <- predict(bag_fit, newdata = grid)Boosting. The
xgb.DMatrixcall combines features and response into a single dataset object.X_train <- as.matrix(sim_data[, c("x1", "x2")]) dtrain <- xgb.DMatrix(data = X_train, label = sim_data$y)The key hyperparameters are
max_depth(tree depth D),eta(learning rate \lambda), andnrounds(boosting iterations B). The boosting fit approximates the true surface much better – we can no longer make out any dominant partition elements.boost_fit <- xgb.train( params = list( max_depth = 2, eta = 0.05, objective = "reg:squarederror" ), data = dtrain, nrounds = 500, verbose = 0 ) X_grid <- as.matrix(grid[, c("x1", "x2")]) grid$boost_pred <- predict(boost_fit, newdata = X_grid)xgboostcomputes variable importances using the the I_j definitions above.xgb.importance(model = boost_fit)Feature Gain Cover Frequency <char> <num> <num> <num> 1: x1 0.5981228 0.5561236 0.5141509 2: x2 0.4018772 0.4438764 0.4858491\dagger We can visualize PD profiles using
DALEX. Each grey line is one sample’ CP profile. The fact that they aren’t all parallel shows that boosting model has learned interactions.explainer <- DALEX::explain(boost_fit, data = X_train, y = sim_data$y, label = "xgboost")Preparation of a new explainer is initiated -> model label : xgboost -> data : 500 rows 2 cols -> data : rownames to data was added ( from 1 to 500 ) -> target variable : 500 values -> predict function : yhat.default will be used ( default ) -> predicted values : No value for predict function target column. ( default ) -> model_info : package Model of class: xgb.Booster package unrecognized , ver. Unknown , task regression ( default ) -> predicted values : numerical, min = -0.2269986 , mean = 6.753096 , max = 14.05496 -> residual function : difference between y and yhat ( default ) -> residuals : numerical, min = -2.760366 , mean = -0.0003332111 , max = 2.185849 A new explainer has been created!pdp <- model_profile(explainer, variables = c("x1", "x2")) plot(pdp, geom = "profiles")